SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER
Min. :100002 Min. :0.00000 Length:307511 Length:307511
1st Qu.:189146 1st Qu.:0.00000 Class :character Class :character
Median :278202 Median :0.00000 Mode :character Mode :character
Mean :278181 Mean :0.08073
3rd Qu.:367143 3rd Qu.:0.00000
Max. :456255 Max. :1.00000
FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL
Length:307511 Length:307511 Min. : 0.0000 Min. : 25650
Class :character Class :character 1st Qu.: 0.0000 1st Qu.: 112500
Mode :character Mode :character Median : 0.0000 Median : 147150
Mean : 0.4171 Mean : 168798
3rd Qu.: 1.0000 3rd Qu.: 202500
Max. :19.0000 Max. :117000000
AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE
Min. : 45000 Min. : 1616 Min. : 40500 Length:307511
1st Qu.: 270000 1st Qu.: 16524 1st Qu.: 238500 Class :character
Median : 513531 Median : 24903 Median : 450000 Mode :character
Mean : 599026 Mean : 27109 Mean : 538396
3rd Qu.: 808650 3rd Qu.: 34596 3rd Qu.: 679500
Max. :4050000 Max. :258026 Max. :4050000
NA's :12 NA's :278
NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE
Length:307511 Length:307511 Length:307511 Length:307511
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION
Min. :0.00029 Min. :-25229 Min. :-17912 Min. :-24672
1st Qu.:0.01001 1st Qu.:-19682 1st Qu.: -2760 1st Qu.: -7480
Median :0.01885 Median :-15750 Median : -1213 Median : -4504
Mean :0.02087 Mean :-16037 Mean : 63815 Mean : -4986
3rd Qu.:0.02866 3rd Qu.:-12413 3rd Qu.: -289 3rd Qu.: -2010
Max. :0.07251 Max. : -7489 Max. :365243 Max. : 0
DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE
Min. :-7197 Min. : 0.00 Min. :0 Min. :0.0000
1st Qu.:-4299 1st Qu.: 5.00 1st Qu.:1 1st Qu.:1.0000
Median :-3254 Median : 9.00 Median :1 Median :1.0000
Mean :-2994 Mean :12.06 Mean :1 Mean :0.8199
3rd Qu.:-1720 3rd Qu.:15.00 3rd Qu.:1 3rd Qu.:1.0000
Max. : 0 Max. :91.00 Max. :1 Max. :1.0000
NA's :202929
FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.00000
Median :0.0000 Median :1.0000 Median :0.0000 Median :0.00000
Mean :0.1994 Mean :0.9981 Mean :0.2811 Mean :0.05672
3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.00000
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT
Length:307511 Min. : 1.000 Min. :1.000
Class :character 1st Qu.: 2.000 1st Qu.:2.000
Mode :character Median : 2.000 Median :2.000
Mean : 2.153 Mean :2.052
3rd Qu.: 3.000 3rd Qu.:2.000
Max. :20.000 Max. :3.000
NA's :2
REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START
Min. :1.000 Length:307511 Min. : 0.00
1st Qu.:2.000 Class :character 1st Qu.:10.00
Median :2.000 Mode :character Median :12.00
Mean :2.032 Mean :12.06
3rd Qu.:2.000 3rd Qu.:14.00
Max. :3.000 Max. :23.00
REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION
Min. :0.00000 Min. :0.00000
1st Qu.:0.00000 1st Qu.:0.00000
Median :0.00000 Median :0.00000
Mean :0.01514 Mean :0.05077
3rd Qu.:0.00000 3rd Qu.:0.00000
Max. :1.00000 Max. :1.00000
LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY
Min. :0.00000 Min. :0.00000 Min. :0.0000
1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000
Median :0.00000 Median :0.00000 Median :0.0000
Mean :0.04066 Mean :0.07817 Mean :0.2305
3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.0000
Max. :1.00000 Max. :1.00000 Max. :1.0000
LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_1 EXT_SOURCE_2
Min. :0.0000 Length:307511 Min. :0.01 Min. :0.0000
1st Qu.:0.0000 Class :character 1st Qu.:0.33 1st Qu.:0.3925
Median :0.0000 Mode :character Median :0.51 Median :0.5660
Mean :0.1796 Mean :0.50 Mean :0.5144
3rd Qu.:0.0000 3rd Qu.:0.68 3rd Qu.:0.6636
Max. :1.0000 Max. :0.96 Max. :0.8550
NA's :173378 NA's :660
EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG
Min. :0.00 Min. :0.00 Min. :0.00 Min. :0.00
1st Qu.:0.37 1st Qu.:0.06 1st Qu.:0.04 1st Qu.:0.98
Median :0.54 Median :0.09 Median :0.08 Median :0.98
Mean :0.51 Mean :0.12 Mean :0.09 Mean :0.98
3rd Qu.:0.67 3rd Qu.:0.15 3rd Qu.:0.11 3rd Qu.:0.99
Max. :0.90 Max. :1.00 Max. :1.00 Max. :1.00
NA's :60965 NA's :156061 NA's :179943 NA's :150007
YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG
Min. :0.00 Min. :0.00 Min. :0.00 Min. :0.00
1st Qu.:0.69 1st Qu.:0.01 1st Qu.:0.00 1st Qu.:0.07
Median :0.76 Median :0.02 Median :0.00 Median :0.14
Mean :0.75 Mean :0.04 Mean :0.08 Mean :0.15
3rd Qu.:0.82 3rd Qu.:0.05 3rd Qu.:0.12 3rd Qu.:0.21
Max. :1.00 Max. :1.00 Max. :1.00 Max. :1.00
NA's :204488 NA's :214865 NA's :163891 NA's :154828
FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG
Min. :0.00 Min. :0.00 Min. :0.00 Min. :0.00
1st Qu.:0.17 1st Qu.:0.08 1st Qu.:0.02 1st Qu.:0.05
Median :0.17 Median :0.21 Median :0.05 Median :0.08
Mean :0.23 Mean :0.23 Mean :0.07 Mean :0.10
3rd Qu.:0.33 3rd Qu.:0.38 3rd Qu.:0.09 3rd Qu.:0.12
Max. :1.00 Max. :1.00 Max. :1.00 Max. :1.00
NA's :153020 NA's :208642 NA's :182590 NA's :210199
LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE
Min. :0.00 Min. :0.00 Min. :0.00 Min. :0.00
1st Qu.:0.05 1st Qu.:0.00 1st Qu.:0.00 1st Qu.:0.05
Median :0.07 Median :0.00 Median :0.00 Median :0.08
Mean :0.11 Mean :0.01 Mean :0.03 Mean :0.11
3rd Qu.:0.13 3rd Qu.:0.00 3rd Qu.:0.03 3rd Qu.:0.14
Max. :1.00 Max. :1.00 Max. :1.00 Max. :1.00
NA's :154350 NA's :213514 NA's :169682 NA's :156061
BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE
Min. :0.00 Min. :0.00 Min. :0.00
1st Qu.:0.04 1st Qu.:0.98 1st Qu.:0.70
Median :0.07 Median :0.98 Median :0.76
Mean :0.09 Mean :0.98 Mean :0.76
3rd Qu.:0.11 3rd Qu.:0.99 3rd Qu.:0.82
Max. :1.00 Max. :1.00 Max. :1.00
NA's :179943 NA's :150007 NA's :204488
COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE
Min. :0.00 Min. :0.00 Min. :0.00 Min. :0.00
1st Qu.:0.01 1st Qu.:0.00 1st Qu.:0.07 1st Qu.:0.17
Median :0.02 Median :0.00 Median :0.14 Median :0.17
Mean :0.04 Mean :0.07 Mean :0.15 Mean :0.22
3rd Qu.:0.05 3rd Qu.:0.12 3rd Qu.:0.21 3rd Qu.:0.33
Max. :1.00 Max. :1.00 Max. :1.00 Max. :1.00
NA's :214865 NA's :163891 NA's :154828 NA's :153020
FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE
Min. :0.00 Min. :0.00 Min. :0.00 Min. :0.00
1st Qu.:0.08 1st Qu.:0.02 1st Qu.:0.05 1st Qu.:0.04
Median :0.21 Median :0.05 Median :0.08 Median :0.07
Mean :0.23 Mean :0.06 Mean :0.11 Mean :0.11
3rd Qu.:0.38 3rd Qu.:0.08 3rd Qu.:0.13 3rd Qu.:0.13
Max. :1.00 Max. :1.00 Max. :1.00 Max. :1.00
NA's :208642 NA's :182590 NA's :210199 NA's :154350
NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI
Min. :0.00 Min. :0.00 Min. :0.00 Min. :0.00
1st Qu.:0.00 1st Qu.:0.00 1st Qu.:0.06 1st Qu.:0.04
Median :0.00 Median :0.00 Median :0.09 Median :0.08
Mean :0.01 Mean :0.03 Mean :0.12 Mean :0.09
3rd Qu.:0.00 3rd Qu.:0.02 3rd Qu.:0.15 3rd Qu.:0.11
Max. :1.00 Max. :1.00 Max. :1.00 Max. :1.00
NA's :213514 NA's :169682 NA's :156061 NA's :179943
YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI
Min. :0.00 Min. :0.00 Min. :0.00
1st Qu.:0.98 1st Qu.:0.69 1st Qu.:0.01
Median :0.98 Median :0.76 Median :0.02
Mean :0.98 Mean :0.76 Mean :0.04
3rd Qu.:0.99 3rd Qu.:0.83 3rd Qu.:0.05
Max. :1.00 Max. :1.00 Max. :1.00
NA's :150007 NA's :204488 NA's :214865
ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI
Min. :0.00 Min. :0.00 Min. :0.00 Min. :0.00
1st Qu.:0.00 1st Qu.:0.07 1st Qu.:0.17 1st Qu.:0.08
Median :0.00 Median :0.14 Median :0.17 Median :0.21
Mean :0.08 Mean :0.15 Mean :0.23 Mean :0.23
3rd Qu.:0.12 3rd Qu.:0.21 3rd Qu.:0.33 3rd Qu.:0.38
Max. :1.00 Max. :1.00 Max. :1.00 Max. :1.00
NA's :163891 NA's :154828 NA's :153020 NA's :208642
LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI
Min. :0.00 Min. :0.00 Min. :0.00
1st Qu.:0.02 1st Qu.:0.05 1st Qu.:0.05
Median :0.05 Median :0.08 Median :0.07
Mean :0.07 Mean :0.10 Mean :0.11
3rd Qu.:0.09 3rd Qu.:0.12 3rd Qu.:0.13
Max. :1.00 Max. :1.00 Max. :1.00
NA's :182590 NA's :210199 NA's :154350
NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI FONDKAPREMONT_MODE
Min. :0.00 Min. :0.00 Length:307511
1st Qu.:0.00 1st Qu.:0.00 Class :character
Median :0.00 Median :0.00 Mode :character
Mean :0.01 Mean :0.03
3rd Qu.:0.00 3rd Qu.:0.03
Max. :1.00 Max. :1.00
NA's :213514 NA's :169682
HOUSETYPE_MODE TOTALAREA_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE
Length:307511 Min. :0.00 Length:307511 Length:307511
Class :character 1st Qu.:0.04 Class :character Class :character
Mode :character Median :0.07 Mode :character Mode :character
Mean :0.10
3rd Qu.:0.13
Max. :1.00
NA's :148431
OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE
Min. : 0.000 Min. : 0.0000 Min. : 0.000
1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 0.000
Median : 0.000 Median : 0.0000 Median : 0.000
Mean : 1.422 Mean : 0.1434 Mean : 1.405
3rd Qu.: 2.000 3rd Qu.: 0.0000 3rd Qu.: 2.000
Max. :348.000 Max. :34.0000 Max. :344.000
NA's :1021 NA's :1021 NA's :1021
DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2
Min. : 0.0 Min. :-4292.0 Min. :0.00e+00
1st Qu.: 0.0 1st Qu.:-1570.0 1st Qu.:0.00e+00
Median : 0.0 Median : -757.0 Median :0.00e+00
Mean : 0.1 Mean : -962.9 Mean :4.23e-05
3rd Qu.: 0.0 3rd Qu.: -274.0 3rd Qu.:0.00e+00
Max. :24.0 Max. : 0.0 Max. :1.00e+00
NA's :1021 NA's :1
FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6
Min. :0.00 Min. :0.00e+00 Min. :0.00000 Min. :0.00000
1st Qu.:0.00 1st Qu.:0.00e+00 1st Qu.:0.00000 1st Qu.:0.00000
Median :1.00 Median :0.00e+00 Median :0.00000 Median :0.00000
Mean :0.71 Mean :8.13e-05 Mean :0.01511 Mean :0.08806
3rd Qu.:1.00 3rd Qu.:0.00e+00 3rd Qu.:0.00000 3rd Qu.:0.00000
Max. :1.00 Max. :1.00e+00 Max. :1.00000 Max. :1.00000
FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10
Min. :0.0000000 Min. :0.00000 Min. :0.000000 Min. :0.00e+00
1st Qu.:0.0000000 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.00e+00
Median :0.0000000 Median :0.00000 Median :0.000000 Median :0.00e+00
Mean :0.0001919 Mean :0.08138 Mean :0.003896 Mean :2.28e-05
3rd Qu.:0.0000000 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.00e+00
Max. :1.0000000 Max. :1.00000 Max. :1.000000 Max. :1.00e+00
FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14
Min. :0.000000 Min. :0.0e+00 Min. :0.000000 Min. :0.000000
1st Qu.:0.000000 1st Qu.:0.0e+00 1st Qu.:0.000000 1st Qu.:0.000000
Median :0.000000 Median :0.0e+00 Median :0.000000 Median :0.000000
Mean :0.003912 Mean :6.5e-06 Mean :0.003525 Mean :0.002936
3rd Qu.:0.000000 3rd Qu.:0.0e+00 3rd Qu.:0.000000 3rd Qu.:0.000000
Max. :1.000000 Max. :1.0e+00 Max. :1.000000 Max. :1.000000
FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18
Min. :0.00000 Min. :0.000000 Min. :0.0000000 Min. :0.00000
1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.0000000 1st Qu.:0.00000
Median :0.00000 Median :0.000000 Median :0.0000000 Median :0.00000
Mean :0.00121 Mean :0.009928 Mean :0.0002667 Mean :0.00813
3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.0000000 3rd Qu.:0.00000
Max. :1.00000 Max. :1.000000 Max. :1.0000000 Max. :1.00000
FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21
Min. :0.0000000 Min. :0.0000000 Min. :0.0000000
1st Qu.:0.0000000 1st Qu.:0.0000000 1st Qu.:0.0000000
Median :0.0000000 Median :0.0000000 Median :0.0000000
Mean :0.0005951 Mean :0.0005073 Mean :0.0003349
3rd Qu.:0.0000000 3rd Qu.:0.0000000 3rd Qu.:0.0000000
Max. :1.0000000 Max. :1.0000000 Max. :1.0000000
AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY
Min. :0.00 Min. :0.00
1st Qu.:0.00 1st Qu.:0.00
Median :0.00 Median :0.00
Mean :0.01 Mean :0.01
3rd Qu.:0.00 3rd Qu.:0.00
Max. :4.00 Max. :9.00
NA's :41519 NA's :41519
AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT
Min. :0.00 Min. : 0.00 Min. : 0.00
1st Qu.:0.00 1st Qu.: 0.00 1st Qu.: 0.00
Median :0.00 Median : 0.00 Median : 0.00
Mean :0.03 Mean : 0.27 Mean : 0.27
3rd Qu.:0.00 3rd Qu.: 0.00 3rd Qu.: 0.00
Max. :8.00 Max. :27.00 Max. :261.00
NA's :41519 NA's :41519 NA's :41519
AMT_REQ_CREDIT_BUREAU_YEAR
Min. : 0.0
1st Qu.: 0.0
Median : 1.0
Mean : 1.9
3rd Qu.: 3.0
Max. :25.0
NA's :41519
Credit Risk Prediction Model for Default Prevention
Foundations of Data Science with R (STAT 359)
Introduction
Project Overview
In this project, we aim to build a predictive model for credit risk with the goal of preventing default. The model will help financial institutions better assess the risk associated with credit applications. This report details the process from data loading and exploration to model building and evaluation.
Project Repository
You can find the code and data for this project in my https://github.com/STAT359-2024SU/359-final-project-CaseyArbelaez.
Data Source
The dataset used for this analysis is sourced from Kaggle and contains various attributes related to credit applications. The data will be used to predict credit risk and improve default prevention strategies.
Necessary Libraries
Before diving into the data, we need to load the essential libraries required for our analysis. These libraries include packages for data manipulation, model training, and evaluation.
Data Loading
Loading the Dataset
We start by loading the dataset into a variable for further analysis. This step is crucial as it prepares the data for subsequent preprocessing and modeling tasks.
Percentage of observations in the minority class: 8.07 %
Visualizing Class Imbalance on Original Dataset
To further understand the dataset, we visualize the distribution of the TARGET variable to illustrate any class imbalance. This plot will help us see the proportion of positive and negative cases in the original dataset.
EDA
To start our EDA (Exploratory Data Analysis), let’s perform the following steps:
- Check for Missing Values
- Summary Statistics for Numerical Features
- Distribution Plots for Key Numerical Variables
- Categorical Variable Analysis
1. Check for Missing Values
Understanding the amount of missing data in each column helps us plan our data cleaning and preprocessing steps.
Variable Missing
COMMONAREA_AVG COMMONAREA_AVG 214865
COMMONAREA_MODE COMMONAREA_MODE 214865
COMMONAREA_MEDI COMMONAREA_MEDI 214865
NONLIVINGAPARTMENTS_AVG NONLIVINGAPARTMENTS_AVG 213514
NONLIVINGAPARTMENTS_MODE NONLIVINGAPARTMENTS_MODE 213514
NONLIVINGAPARTMENTS_MEDI NONLIVINGAPARTMENTS_MEDI 213514
FONDKAPREMONT_MODE FONDKAPREMONT_MODE 210295
LIVINGAPARTMENTS_AVG LIVINGAPARTMENTS_AVG 210199
LIVINGAPARTMENTS_MODE LIVINGAPARTMENTS_MODE 210199
LIVINGAPARTMENTS_MEDI LIVINGAPARTMENTS_MEDI 210199
FLOORSMIN_AVG FLOORSMIN_AVG 208642
FLOORSMIN_MODE FLOORSMIN_MODE 208642
FLOORSMIN_MEDI FLOORSMIN_MEDI 208642
YEARS_BUILD_AVG YEARS_BUILD_AVG 204488
YEARS_BUILD_MODE YEARS_BUILD_MODE 204488
YEARS_BUILD_MEDI YEARS_BUILD_MEDI 204488
OWN_CAR_AGE OWN_CAR_AGE 202929
LANDAREA_AVG LANDAREA_AVG 182590
LANDAREA_MODE LANDAREA_MODE 182590
LANDAREA_MEDI LANDAREA_MEDI 182590
BASEMENTAREA_AVG BASEMENTAREA_AVG 179943
BASEMENTAREA_MODE BASEMENTAREA_MODE 179943
BASEMENTAREA_MEDI BASEMENTAREA_MEDI 179943
EXT_SOURCE_1 EXT_SOURCE_1 173378
NONLIVINGAREA_AVG NONLIVINGAREA_AVG 169682
NONLIVINGAREA_MODE NONLIVINGAREA_MODE 169682
NONLIVINGAREA_MEDI NONLIVINGAREA_MEDI 169682
ELEVATORS_AVG ELEVATORS_AVG 163891
ELEVATORS_MODE ELEVATORS_MODE 163891
ELEVATORS_MEDI ELEVATORS_MEDI 163891
WALLSMATERIAL_MODE WALLSMATERIAL_MODE 156341
APARTMENTS_AVG APARTMENTS_AVG 156061
APARTMENTS_MODE APARTMENTS_MODE 156061
APARTMENTS_MEDI APARTMENTS_MEDI 156061
ENTRANCES_AVG ENTRANCES_AVG 154828
ENTRANCES_MODE ENTRANCES_MODE 154828
ENTRANCES_MEDI ENTRANCES_MEDI 154828
LIVINGAREA_AVG LIVINGAREA_AVG 154350
LIVINGAREA_MODE LIVINGAREA_MODE 154350
LIVINGAREA_MEDI LIVINGAREA_MEDI 154350
HOUSETYPE_MODE HOUSETYPE_MODE 154297
FLOORSMAX_AVG FLOORSMAX_AVG 153020
FLOORSMAX_MODE FLOORSMAX_MODE 153020
FLOORSMAX_MEDI FLOORSMAX_MEDI 153020
YEARS_BEGINEXPLUATATION_AVG YEARS_BEGINEXPLUATATION_AVG 150007
YEARS_BEGINEXPLUATATION_MODE YEARS_BEGINEXPLUATATION_MODE 150007
YEARS_BEGINEXPLUATATION_MEDI YEARS_BEGINEXPLUATATION_MEDI 150007
TOTALAREA_MODE TOTALAREA_MODE 148431
EMERGENCYSTATE_MODE EMERGENCYSTATE_MODE 145755
OCCUPATION_TYPE OCCUPATION_TYPE 96391
EXT_SOURCE_3 EXT_SOURCE_3 60965
AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_HOUR 41519
AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_DAY 41519
AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_WEEK 41519
AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_MON 41519
AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_QRT 41519
AMT_REQ_CREDIT_BUREAU_YEAR AMT_REQ_CREDIT_BUREAU_YEAR 41519
NAME_TYPE_SUITE NAME_TYPE_SUITE 1292
OBS_30_CNT_SOCIAL_CIRCLE OBS_30_CNT_SOCIAL_CIRCLE 1021
DEF_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE 1021
OBS_60_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE 1021
DEF_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE 1021
EXT_SOURCE_2 EXT_SOURCE_2 660
AMT_GOODS_PRICE AMT_GOODS_PRICE 278
AMT_ANNUITY AMT_ANNUITY 12
CNT_FAM_MEMBERS CNT_FAM_MEMBERS 2
DAYS_LAST_PHONE_CHANGE DAYS_LAST_PHONE_CHANGE 1
2. Summary Statistics for Numerical Features
We’ll explore the summary statistics to get a sense of the range, central tendency, and spread of the numerical features.
SK_ID_CURR TARGET CNT_CHILDREN AMT_INCOME_TOTAL
Min. :100002 Min. :0.00000 Min. : 0.0000 Min. : 25650
1st Qu.:189146 1st Qu.:0.00000 1st Qu.: 0.0000 1st Qu.: 112500
Median :278202 Median :0.00000 Median : 0.0000 Median : 147150
Mean :278181 Mean :0.08073 Mean : 0.4171 Mean : 168798
3rd Qu.:367143 3rd Qu.:0.00000 3rd Qu.: 1.0000 3rd Qu.: 202500
Max. :456255 Max. :1.00000 Max. :19.0000 Max. :117000000
AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE
Min. : 45000 Min. : 1616 Min. : 40500
1st Qu.: 270000 1st Qu.: 16524 1st Qu.: 238500
Median : 513531 Median : 24903 Median : 450000
Mean : 599026 Mean : 27109 Mean : 538396
3rd Qu.: 808650 3rd Qu.: 34596 3rd Qu.: 679500
Max. :4050000 Max. :258026 Max. :4050000
NA's :12 NA's :278
REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION
Min. :0.00029 Min. :-25229 Min. :-17912 Min. :-24672
1st Qu.:0.01001 1st Qu.:-19682 1st Qu.: -2760 1st Qu.: -7480
Median :0.01885 Median :-15750 Median : -1213 Median : -4504
Mean :0.02087 Mean :-16037 Mean : 63815 Mean : -4986
3rd Qu.:0.02866 3rd Qu.:-12413 3rd Qu.: -289 3rd Qu.: -2010
Max. :0.07251 Max. : -7489 Max. :365243 Max. : 0
DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE
Min. :-7197 Min. : 0.00 Min. :0 Min. :0.0000
1st Qu.:-4299 1st Qu.: 5.00 1st Qu.:1 1st Qu.:1.0000
Median :-3254 Median : 9.00 Median :1 Median :1.0000
Mean :-2994 Mean :12.06 Mean :1 Mean :0.8199
3rd Qu.:-1720 3rd Qu.:15.00 3rd Qu.:1 3rd Qu.:1.0000
Max. : 0 Max. :91.00 Max. :1 Max. :1.0000
NA's :202929
FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.00000
Median :0.0000 Median :1.0000 Median :0.0000 Median :0.00000
Mean :0.1994 Mean :0.9981 Mean :0.2811 Mean :0.05672
3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.00000
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY
Min. : 1.000 Min. :1.000 Min. :1.000
1st Qu.: 2.000 1st Qu.:2.000 1st Qu.:2.000
Median : 2.000 Median :2.000 Median :2.000
Mean : 2.153 Mean :2.052 Mean :2.032
3rd Qu.: 3.000 3rd Qu.:2.000 3rd Qu.:2.000
Max. :20.000 Max. :3.000 Max. :3.000
NA's :2
HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION
Min. : 0.00 Min. :0.00000 Min. :0.00000
1st Qu.:10.00 1st Qu.:0.00000 1st Qu.:0.00000
Median :12.00 Median :0.00000 Median :0.00000
Mean :12.06 Mean :0.01514 Mean :0.05077
3rd Qu.:14.00 3rd Qu.:0.00000 3rd Qu.:0.00000
Max. :23.00 Max. :1.00000 Max. :1.00000
LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY
Min. :0.00000 Min. :0.00000 Min. :0.0000
1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000
Median :0.00000 Median :0.00000 Median :0.0000
Mean :0.04066 Mean :0.07817 Mean :0.2305
3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.0000
Max. :1.00000 Max. :1.00000 Max. :1.0000
LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3
Min. :0.0000 Min. :0.01 Min. :0.0000 Min. :0.00
1st Qu.:0.0000 1st Qu.:0.33 1st Qu.:0.3925 1st Qu.:0.37
Median :0.0000 Median :0.51 Median :0.5660 Median :0.54
Mean :0.1796 Mean :0.50 Mean :0.5144 Mean :0.51
3rd Qu.:0.0000 3rd Qu.:0.68 3rd Qu.:0.6636 3rd Qu.:0.67
Max. :1.0000 Max. :0.96 Max. :0.8550 Max. :0.90
NA's :173378 NA's :660 NA's :60965
APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG
Min. :0.00 Min. :0.00 Min. :0.00 Min. :0.00
1st Qu.:0.06 1st Qu.:0.04 1st Qu.:0.98 1st Qu.:0.69
Median :0.09 Median :0.08 Median :0.98 Median :0.76
Mean :0.12 Mean :0.09 Mean :0.98 Mean :0.75
3rd Qu.:0.15 3rd Qu.:0.11 3rd Qu.:0.99 3rd Qu.:0.82
Max. :1.00 Max. :1.00 Max. :1.00 Max. :1.00
NA's :156061 NA's :179943 NA's :150007 NA's :204488
COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG
Min. :0.00 Min. :0.00 Min. :0.00 Min. :0.00
1st Qu.:0.01 1st Qu.:0.00 1st Qu.:0.07 1st Qu.:0.17
Median :0.02 Median :0.00 Median :0.14 Median :0.17
Mean :0.04 Mean :0.08 Mean :0.15 Mean :0.23
3rd Qu.:0.05 3rd Qu.:0.12 3rd Qu.:0.21 3rd Qu.:0.33
Max. :1.00 Max. :1.00 Max. :1.00 Max. :1.00
NA's :214865 NA's :163891 NA's :154828 NA's :153020
FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG
Min. :0.00 Min. :0.00 Min. :0.00 Min. :0.00
1st Qu.:0.08 1st Qu.:0.02 1st Qu.:0.05 1st Qu.:0.05
Median :0.21 Median :0.05 Median :0.08 Median :0.07
Mean :0.23 Mean :0.07 Mean :0.10 Mean :0.11
3rd Qu.:0.38 3rd Qu.:0.09 3rd Qu.:0.12 3rd Qu.:0.13
Max. :1.00 Max. :1.00 Max. :1.00 Max. :1.00
NA's :208642 NA's :182590 NA's :210199 NA's :154350
NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE
Min. :0.00 Min. :0.00 Min. :0.00 Min. :0.00
1st Qu.:0.00 1st Qu.:0.00 1st Qu.:0.05 1st Qu.:0.04
Median :0.00 Median :0.00 Median :0.08 Median :0.07
Mean :0.01 Mean :0.03 Mean :0.11 Mean :0.09
3rd Qu.:0.00 3rd Qu.:0.03 3rd Qu.:0.14 3rd Qu.:0.11
Max. :1.00 Max. :1.00 Max. :1.00 Max. :1.00
NA's :213514 NA's :169682 NA's :156061 NA's :179943
YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE
Min. :0.00 Min. :0.00 Min. :0.00
1st Qu.:0.98 1st Qu.:0.70 1st Qu.:0.01
Median :0.98 Median :0.76 Median :0.02
Mean :0.98 Mean :0.76 Mean :0.04
3rd Qu.:0.99 3rd Qu.:0.82 3rd Qu.:0.05
Max. :1.00 Max. :1.00 Max. :1.00
NA's :150007 NA's :204488 NA's :214865
ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE
Min. :0.00 Min. :0.00 Min. :0.00 Min. :0.00
1st Qu.:0.00 1st Qu.:0.07 1st Qu.:0.17 1st Qu.:0.08
Median :0.00 Median :0.14 Median :0.17 Median :0.21
Mean :0.07 Mean :0.15 Mean :0.22 Mean :0.23
3rd Qu.:0.12 3rd Qu.:0.21 3rd Qu.:0.33 3rd Qu.:0.38
Max. :1.00 Max. :1.00 Max. :1.00 Max. :1.00
NA's :163891 NA's :154828 NA's :153020 NA's :208642
LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE
Min. :0.00 Min. :0.00 Min. :0.00
1st Qu.:0.02 1st Qu.:0.05 1st Qu.:0.04
Median :0.05 Median :0.08 Median :0.07
Mean :0.06 Mean :0.11 Mean :0.11
3rd Qu.:0.08 3rd Qu.:0.13 3rd Qu.:0.13
Max. :1.00 Max. :1.00 Max. :1.00
NA's :182590 NA's :210199 NA's :154350
NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI
Min. :0.00 Min. :0.00 Min. :0.00 Min. :0.00
1st Qu.:0.00 1st Qu.:0.00 1st Qu.:0.06 1st Qu.:0.04
Median :0.00 Median :0.00 Median :0.09 Median :0.08
Mean :0.01 Mean :0.03 Mean :0.12 Mean :0.09
3rd Qu.:0.00 3rd Qu.:0.02 3rd Qu.:0.15 3rd Qu.:0.11
Max. :1.00 Max. :1.00 Max. :1.00 Max. :1.00
NA's :213514 NA's :169682 NA's :156061 NA's :179943
YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI
Min. :0.00 Min. :0.00 Min. :0.00
1st Qu.:0.98 1st Qu.:0.69 1st Qu.:0.01
Median :0.98 Median :0.76 Median :0.02
Mean :0.98 Mean :0.76 Mean :0.04
3rd Qu.:0.99 3rd Qu.:0.83 3rd Qu.:0.05
Max. :1.00 Max. :1.00 Max. :1.00
NA's :150007 NA's :204488 NA's :214865
ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI
Min. :0.00 Min. :0.00 Min. :0.00 Min. :0.00
1st Qu.:0.00 1st Qu.:0.07 1st Qu.:0.17 1st Qu.:0.08
Median :0.00 Median :0.14 Median :0.17 Median :0.21
Mean :0.08 Mean :0.15 Mean :0.23 Mean :0.23
3rd Qu.:0.12 3rd Qu.:0.21 3rd Qu.:0.33 3rd Qu.:0.38
Max. :1.00 Max. :1.00 Max. :1.00 Max. :1.00
NA's :163891 NA's :154828 NA's :153020 NA's :208642
LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI
Min. :0.00 Min. :0.00 Min. :0.00
1st Qu.:0.02 1st Qu.:0.05 1st Qu.:0.05
Median :0.05 Median :0.08 Median :0.07
Mean :0.07 Mean :0.10 Mean :0.11
3rd Qu.:0.09 3rd Qu.:0.12 3rd Qu.:0.13
Max. :1.00 Max. :1.00 Max. :1.00
NA's :182590 NA's :210199 NA's :154350
NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE
Min. :0.00 Min. :0.00 Min. :0.00
1st Qu.:0.00 1st Qu.:0.00 1st Qu.:0.04
Median :0.00 Median :0.00 Median :0.07
Mean :0.01 Mean :0.03 Mean :0.10
3rd Qu.:0.00 3rd Qu.:0.03 3rd Qu.:0.13
Max. :1.00 Max. :1.00 Max. :1.00
NA's :213514 NA's :169682 NA's :148431
OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE
Min. : 0.000 Min. : 0.0000 Min. : 0.000
1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 0.000
Median : 0.000 Median : 0.0000 Median : 0.000
Mean : 1.422 Mean : 0.1434 Mean : 1.405
3rd Qu.: 2.000 3rd Qu.: 0.0000 3rd Qu.: 2.000
Max. :348.000 Max. :34.0000 Max. :344.000
NA's :1021 NA's :1021 NA's :1021
DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2
Min. : 0.0 Min. :-4292.0 Min. :0.00e+00
1st Qu.: 0.0 1st Qu.:-1570.0 1st Qu.:0.00e+00
Median : 0.0 Median : -757.0 Median :0.00e+00
Mean : 0.1 Mean : -962.9 Mean :4.23e-05
3rd Qu.: 0.0 3rd Qu.: -274.0 3rd Qu.:0.00e+00
Max. :24.0 Max. : 0.0 Max. :1.00e+00
NA's :1021 NA's :1
FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6
Min. :0.00 Min. :0.00e+00 Min. :0.00000 Min. :0.00000
1st Qu.:0.00 1st Qu.:0.00e+00 1st Qu.:0.00000 1st Qu.:0.00000
Median :1.00 Median :0.00e+00 Median :0.00000 Median :0.00000
Mean :0.71 Mean :8.13e-05 Mean :0.01511 Mean :0.08806
3rd Qu.:1.00 3rd Qu.:0.00e+00 3rd Qu.:0.00000 3rd Qu.:0.00000
Max. :1.00 Max. :1.00e+00 Max. :1.00000 Max. :1.00000
FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10
Min. :0.0000000 Min. :0.00000 Min. :0.000000 Min. :0.00e+00
1st Qu.:0.0000000 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.00e+00
Median :0.0000000 Median :0.00000 Median :0.000000 Median :0.00e+00
Mean :0.0001919 Mean :0.08138 Mean :0.003896 Mean :2.28e-05
3rd Qu.:0.0000000 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.00e+00
Max. :1.0000000 Max. :1.00000 Max. :1.000000 Max. :1.00e+00
FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14
Min. :0.000000 Min. :0.0e+00 Min. :0.000000 Min. :0.000000
1st Qu.:0.000000 1st Qu.:0.0e+00 1st Qu.:0.000000 1st Qu.:0.000000
Median :0.000000 Median :0.0e+00 Median :0.000000 Median :0.000000
Mean :0.003912 Mean :6.5e-06 Mean :0.003525 Mean :0.002936
3rd Qu.:0.000000 3rd Qu.:0.0e+00 3rd Qu.:0.000000 3rd Qu.:0.000000
Max. :1.000000 Max. :1.0e+00 Max. :1.000000 Max. :1.000000
FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18
Min. :0.00000 Min. :0.000000 Min. :0.0000000 Min. :0.00000
1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.0000000 1st Qu.:0.00000
Median :0.00000 Median :0.000000 Median :0.0000000 Median :0.00000
Mean :0.00121 Mean :0.009928 Mean :0.0002667 Mean :0.00813
3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.0000000 3rd Qu.:0.00000
Max. :1.00000 Max. :1.000000 Max. :1.0000000 Max. :1.00000
FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21
Min. :0.0000000 Min. :0.0000000 Min. :0.0000000
1st Qu.:0.0000000 1st Qu.:0.0000000 1st Qu.:0.0000000
Median :0.0000000 Median :0.0000000 Median :0.0000000
Mean :0.0005951 Mean :0.0005073 Mean :0.0003349
3rd Qu.:0.0000000 3rd Qu.:0.0000000 3rd Qu.:0.0000000
Max. :1.0000000 Max. :1.0000000 Max. :1.0000000
AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY
Min. :0.00 Min. :0.00
1st Qu.:0.00 1st Qu.:0.00
Median :0.00 Median :0.00
Mean :0.01 Mean :0.01
3rd Qu.:0.00 3rd Qu.:0.00
Max. :4.00 Max. :9.00
NA's :41519 NA's :41519
AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT
Min. :0.00 Min. : 0.00 Min. : 0.00
1st Qu.:0.00 1st Qu.: 0.00 1st Qu.: 0.00
Median :0.00 Median : 0.00 Median : 0.00
Mean :0.03 Mean : 0.27 Mean : 0.27
3rd Qu.:0.00 3rd Qu.: 0.00 3rd Qu.: 0.00
Max. :8.00 Max. :27.00 Max. :261.00
NA's :41519 NA's :41519 NA's :41519
AMT_REQ_CREDIT_BUREAU_YEAR
Min. : 0.0
1st Qu.: 0.0
Median : 1.0
Mean : 1.9
3rd Qu.: 3.0
Max. :25.0
NA's :41519
3. Distribution Plots for Key Numerical Variables
Visualizing the distribution of key numerical variables can help us detect skewness, outliers, and the need for transformations.
4. Categorical Variable Analysis
Visualizing the distribution of categorical variables helps us understand the frequency of different categories.
PCA Analysis
Let’s perform PCA (Principal Component Analysis) to identify the variance captured by the principal components and visualize it using scree plots and cumulative explained variance. This will help us assess whether we can reduce dimensionality by leveraging strong linear relationships in the data.
Steps:
- Data Preparation: We will preprocess the data, focusing on scaling numeric columns.
- PCA Computation: Perform PCA on the standardized numeric data.
- Scree Plot: Plot the explained variance of each principal component.
- Cumulative Explained Variance Plot: Visualize the cumulative variance explained to determine the number of components needed to capture most of the variance.
1. Data Preparation
First, let’s preprocess the data by selecting numeric columns and standardizing them.
2. PCA Computation
Now, let’s perform PCA on the scaled numeric data.
3. Scree Plot
We’ll plot the variance explained by each principal component.
4. Cumulative Explained Variance Plot
Next, we’ll visualize the cumulative explained variance to determine how many components explain a significant portion of the variance.
Interpretation
Scree Plot: This plot will show how much variance each principal component explains. Look for the “elbow” point where adding more components yields diminishing returns in explained variance.
Cumulative Explained Variance Plot: This will help identify the number of principal components that capture a desired threshold (e.g., 90%) of the total variance.
Based on these plots, our columns in the data do not show strong linear relationships between one another since based on our scree plot the first Principal Component only obtains 20% of the variation in the data. In order to obtain 90% of the data we would need to acquire about 50 principal components. Therefore a PCA will not be a beneficial transformation to perform due to the more complex relationships in the data especially considering that we have a class imbalance, it is crucial that we preserve our data.
UMAP (Uniform Manifold Approximation and Projection) Analysis
I attempted to use UMAP (Uniform Manifold Approximation and Projection) to explore whether it could cluster our credit_data more effectively than PCA, potentially capturing more complex relationships in a low-dimensional projection. UMAP is particularly useful when data has non-linear relationships that PCA might not capture due to its linear nature. By applying UMAP, I aimed to visualize the data in 2D and 3D spaces to check for any natural clusters that might emerge, especially given the imbalanced nature of our target variable.
Steps
- Data Preparation:
- I performed one-hot encoding on the categorical variables and dropped columns with over 100,000 missing values. I also removed rows with missing data.
- To deal with class imbalance, I downsampled the data to create a balanced subset for the analysis.
- UMAP Implementation:
- I performed UMAP dimensionality reduction with specified parameters such as
n_neighborsandmin_distto project the data into two and three dimensions. - I plotted the UMAP results using both 2D and 3D visualizations to observe how well the data clustered according to the target variable.
- I performed UMAP dimensionality reduction with specified parameters such as
- Hyperparameter Tuning:
- I explored different hyperparameter combinations for UMAP by looping through various values of
n_neighbors,min_dist, andspread. - For each combination, I generated and saved plots to visualize how changes in these parameters affected the clustering.
- I explored different hyperparameter combinations for UMAP by looping through various values of
- Comparison with PCA:
- By using UMAP, I sought to explore potential non-linear relationships and more complex clustering structures that PCA might miss, especially in cases where linear assumptions do not hold strongly.
Here are some of the plots that were generated by this transformation:
Note: For more plots with different hyper parameters check out the plots folder in the repo
UMAP is a powerful tool for visualizing high-dimensional data, and it complements PCA well, especially when the data exhibits non-linear relationships that are not fully captured by linear techniques like PCA. However, even though these transformations fail to reveal underlying clusters within our data, they still motivate the exploration component in EDA. For future reference, we can experiment with PCA -> UMAP -> KNN model or UMAP -> KNN model because while our UMAP did not result in global clusters, locally there were some cluster that could be observed. This may be something we look into further as we could build a KNN classifier with small k to pick up on the regional patterns that are being recognized in our UMAP.
Feature Selection
Based on our inspection, we will select specific categorical and numerical columns for our logistic regression, knn, and Random Forest model. The chosen categorical features include: Certainly! Here’s the description of each feature in the same format:
Categorical Features:
CODE_GENDER: Gender of the applicant (e.g., “M” for Male, “F” for Female).NAME_CONTRACT_TYPE: Type of loan contract (e.g., “Cash loans,” “Revolving loans”).FLAG_OWN_CAR: Indicates car ownership (“Y” for Yes, “N” for No).FLAG_OWN_REALTY: Indicates real estate ownership (“Y” for Yes, “N” for No).NAME_INCOME_TYPE: Type of income of the applicant (e.g., “Working,” “Commercial associate,” “Pensioner”).NAME_EDUCATION_TYPE: Educational background of the applicant (e.g., “Higher education,” “Secondary education,” “Incomplete higher”).NAME_FAMILY_STATUS: Family status of the applicant (e.g., “Married,” “Single / not married,” “Divorced”).NAME_HOUSING_TYPE: Housing situation of the applicant (e.g., “House / apartment,” “With parents,” “Municipal apartment”).WEEKDAY_APPR_PROCESS_START: Day of the week when the application process started (e.g., “Monday,” “Tuesday”).REG_REGION_NOT_LIVE_REGION: Indicates if the applicant’s region is not the same as the registration region (“Y” for Yes, “N” for No).
Numerical Features:
AMT_ANNUITY: Annual loan payment amount.AMT_CREDIT: Total credit amount provided.CNT_CHILDREN: Number of children or dependents.AMT_INCOME_TOTAL: Total annual income of the applicant.AMT_GOODS_PRICE: Price of the goods the loan is taken for.DAYS_EMPLOYED: Number of days since the applicant was last employed (negative values represent days before the application date).DAYS_REGISTRATION: Number of days since the applicant registered their residence (negative values represent days before the application date).DAYS_BIRTH: Age of the applicant in days (negative values represent days before the application date).AMT_REQ_CREDIT_BUREAU_HOUR: Number of credit bureau requests in the past hour.AMT_REQ_CREDIT_BUREAU_DAY: Number of credit bureau requests in the past day.AMT_REQ_CREDIT_BUREAU_WEEK: Number of credit bureau requests in the past week.AMT_REQ_CREDIT_BUREAU_MON: Number of credit bureau requests in the past month.AMT_REQ_CREDIT_BUREAU_QRT: Number of credit bureau requests in the past quarter.AMT_REQ_CREDIT_BUREAU_YEAR: Number of credit bureau requests in the past year.OBS_30_CNT_SOCIAL_CIRCLE: Number of social circle members with 30 or more days overdue on credit.DEF_30_CNT_SOCIAL_CIRCLE: Number of social circle members who defaulted in the past 30 days.OBS_60_CNT_SOCIAL_CIRCLE: Number of social circle members with 60 or more days overdue on credit.DEF_60_CNT_SOCIAL_CIRCLE: Number of social circle members who defaulted in the past 60 days.DAYS_LAST_PHONE_CHANGE: Number of days since the applicant last changed their phone number.
These features are chosen based on their potential relevance to the credit risk prediction as I believe these would be key indicators to analyze before issuing a loan to somebody.
Selecting the relevant columns from the dataset to focus on key variables for analysis.
TARGET CODE_GENDER
0 0
NAME_CONTRACT_TYPE FLAG_OWN_CAR
0 0
FLAG_OWN_REALTY NAME_INCOME_TYPE
0 0
NAME_EDUCATION_TYPE NAME_FAMILY_STATUS
0 0
NAME_HOUSING_TYPE WEEKDAY_APPR_PROCESS_START
0 0
REG_REGION_NOT_LIVE_REGION AMT_ANNUITY
0 12
AMT_CREDIT CNT_CHILDREN
0 0
AMT_INCOME_TOTAL AMT_GOODS_PRICE
0 278
DAYS_EMPLOYED DAYS_REGISTRATION
0 0
DAYS_BIRTH AMT_REQ_CREDIT_BUREAU_HOUR
0 41519
AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK
41519 41519
AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT
41519 41519
AMT_REQ_CREDIT_BUREAU_YEAR OBS_30_CNT_SOCIAL_CIRCLE
41519 1021
DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE
1021 1021
DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE
1021 1
[1] 264898 30
Assessing the amount of missing data in the selected columns and removing rows with any missing values to prepare the data for further analysis.
TARGET CODE_GENDER
0 4
NAME_CONTRACT_TYPE FLAG_OWN_CAR
0 0
FLAG_OWN_REALTY NAME_INCOME_TYPE
0 0
NAME_EDUCATION_TYPE NAME_FAMILY_STATUS
0 0
NAME_HOUSING_TYPE WEEKDAY_APPR_PROCESS_START
0 0
REG_REGION_NOT_LIVE_REGION AMT_ANNUITY
0 0
AMT_CREDIT CNT_CHILDREN
0 0
AMT_INCOME_TOTAL AMT_GOODS_PRICE
0 0
DAYS_EMPLOYED DAYS_REGISTRATION
0 0
DAYS_BIRTH AMT_REQ_CREDIT_BUREAU_HOUR
0 0
AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK
0 0
AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT
0 0
AMT_REQ_CREDIT_BUREAU_YEAR OBS_30_CNT_SOCIAL_CIRCLE
0 0
DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE
0 0
DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE
0 0
Filtering the dataset to remove rows with specific unwanted values and examining the target variable’s distribution to understand the class imbalance.
0 1
244403 20489
Percentage of observations in the minority class: 7.73 %
Visualizing Class Imbalance on Cleaned Dataset
To further understand the dataset, we visualize the distribution of the TARGET variable to illustrate any class imbalance. This plot will help us see the proportion of positive and negative cases in the Cleaned dataset. It is important to note that by dropping the observations with NA values our class imbalance decreased by 0.34% which is significant considering 8.07% was the percentage of the minority class in the original dataset. For future development, one alternative to dropping observations with NA values can be to simply replace them with the median (for numeric features) or mode (for categorical features) response in that particular feature column.
Data Sampling
Random Sampling for Balanced Dataset
To enhance the efficiency of model training and ensure a manageable dataset size, we randomly sample the data to achieve a total of 30,000 observations. This process involves calculating the proportion of each class in the original dataset and then determining the number of samples needed from each class to maintain the original distribution.
To improve training efficiency and manage the size of the dataset, we downsample it to 30,000 observations. We first calculate the proportion of each class in the original dataset and then determine how many samples to draw from each class to maintain these proportions in the downsampled dataset. By setting a seed, we ensure that the sampling process is reproducible. The sampling function extracts the required number of samples for each class, and the results are combined into a single dataset. This balanced dataset is then saved for use in subsequent model training, allowing for more efficient and focused analysis.
Visualizing Class Imbalance on Downsampled Dataset
To further understand the downsampled dataset, we visualize the distribution of the TARGET variable to illustrate any class imbalance and show that it resembles that of the original dataset. This plot will help us see the proportion of positive and negative cases in the Downsampled dataset.
Percentage of observations in the minority class: 8.07 %
As we can see the downsampled data target variable distribution mimics the original data target target variable distribution very well because we sampled based off the original populations target variable distribution. However, this does not save us from the large gap between classes in the target variable. This naturally leads us to consider upsampling on our minority class to prevent the model from getting over influenced from the majority class.
Visualizing Feature Distributions
To better understand the distribution of our feature variables, we plot histograms. This helps us identify the distribution patterns and the need for any transformations.
We define three functions for plotting histograms. The plot_histogram_logs function applies a log transformation to better visualize variables with skewed distributions, making them easier to interpret. The plot_histogram function plots the distribution of a variable without transformation. The new plot_histogram_power function applies a power transformation (such as square root) to visualize variables that benefit from reducing skewness. These plots help us understand the distribution and skewness of each feature, which can inform our decisions on necessary data transformations or preprocessing steps for the modeling phase.
Visualizing Numeric Variable Distributions
To gain insights into the distribution of numeric variables, we generate histograms for each variable. This helps us understand their distributions and decide if any transformations are needed.
Histograms of Numeric Variables
First, we plot histograms for all numeric variables without applying any transformations.
[[1]]
[[2]]
[[3]]
[[4]]
[[5]]
[[6]]
[[7]]
[[8]]
[[9]]
[[10]]
[[11]]
[[12]]
[[13]]
[[14]]
[[15]]
[[16]]
[[17]]
[[18]]
[[19]]
[[20]]
[[21]]
These histograms provide a visual representation of the distribution of each numeric variable. They reveal the general shape of the data, including any skewness or extreme values.
Next, we apply a log transformation to these variables and plot the histograms again.
[[1]]
[[2]]
[[3]]
[[4]]
[[5]]
[[6]]
[[7]]
[[8]]
[[9]]
[[10]]
[[11]]
[[12]]
[[13]]
[[14]]
[[15]]
[[16]]
[[17]]
[[18]]
[[19]]
[[20]]
[[21]]
These histograms provide a visual representation of the distribution of each numeric variable with a log transformation. With this transformation, we see that some numerical features benefit greatly from this, which we will cover later on.
Next, we apply a square root transformation to these variables and plot the histograms again.
[[1]]
[[2]]
[[3]]
[[4]]
[[5]]
[[6]]
[[7]]
[[8]]
[[9]]
[[10]]
[[11]]
[[12]]
[[13]]
[[14]]
[[15]]
[[16]]
[[17]]
[[18]]
[[19]]
[[20]]
[[21]]
Benefits of Log and Square Root Transformations
Log Transformation
The histograms with log transformation offer a clearer view of the data distribution, especially for variables with skewed distributions or extreme values. Variables like AMT_CREDIT and AMT_INCOME_TOTAL often exhibit a right-skewed distribution that can be better normalized using a log transformation. This makes the data more suitable for modeling, as many machine learning algorithms perform better with features that approximate a normal distribution.
Variables Transformed Using Log: - AMT_ANNUITY: Reduces the impact of high values and normalizes the distribution. - AMT_CREDIT: Addresses right skewness and helps in stabilizing variance. - AMT_INCOME_TOTAL: Reduces extreme values and helps in normalizing income distribution. - AMT_GOODS_PRICE: Normalizes high-value outliers and improves data distribution. - DAYS_EMPLOYED: Helps to reduce the impact of extreme values related to employment duration.
Square Root Transformation
Square root transformation is effective for variables with a distribution that exhibits moderate skewness, particularly when the values are non-negative and have a range of scales. It reduces the impact of large values and stabilizes variance, making the data more manageable for modeling.
Variables Transformed Using Square Root: - DAYS_REGISTRATION: Reduces skewness and normalizes the distribution of registration days. - DAYS_BIRTH: Addresses skewness and makes age-related data more suitable for modeling.
Note: The CNT_CHILDREN along with other variables does not benefit significantly from log transformation due to its discrete nature and relatively consistent range of values. Therefore, it is left unchanged in our transformation process.
By applying these transformations, we aim to improve the distribution of our features, making them more appropriate for machine learning models and improving overall model performance.
Data Preparation and Balancing
Splitting the Data
To build and evaluate our predictive model, we first split the downsampled dataset into training and testing sets. We use an 80-20 split, ensuring that both sets maintain a representative distribution of the target variable.
Handling Class Imbalance
In our dataset, the TARGET variable is extremely imbalanced, meaning that the number of non-default cases far exceeds the number of default cases. This imbalance can lead to biased models that favor the majority class. To address this, we use the ROSE (Random Over-Sampling Examples) library to upsample the minority class, increasing its representation in the training data so that the model can learn about the minority class.
0 1
22061 7939
Percentage of observations in the minority class: 26.46 %
By generating more examples of the minority class (fraud cases), we ensure that the model learns about both classes more effectively. This helps prevent the majority class from overwhelming the model’s learning process and improves the model’s ability to detect fraud.
Preprocessing Recipes and Model Workflows
In this section, we define and apply preprocessing recipes for our models. These recipes ensure that the data is properly prepared before training. We will cover logistic regression, k-Nearest Neighbors (KNN), and Random Forest (RF) models. We are going to use ROC_AUC
Logistic Regression
Recipe Definition
For the logistic regression model, we create a recipe that includes:
- One-Hot Encoding: Converts categorical variables into a binary matrix.
- Zero Variance Removal: Removes predictors with no variance.
- Log Transformation: Applies a log transformation to skewed numerical variables to improve normality.
- Centering and Scaling: Centers and scales numerical predictors to standardize them.
Apply the Recipe
Before training, we apply the recipes to ensure that preprocessing is correctly applied.
tibble [30,000 × 48] (S3: tbl_df/tbl/data.frame)
$ REG_REGION_NOT_LIVE_REGION : num [1:30000] -0.12 -0.12 -0.12 -0.12 -0.12 ...
$ AMT_ANNUITY : num [1:30000] -0.0613 -1.4863 0.5548 -0.6555 0.5697 ...
$ AMT_CREDIT : num [1:30000] -0.1 -1.072 0.716 0.168 0.634 ...
$ CNT_CHILDREN : num [1:30000] -0.589 -0.589 0.796 -0.589 -0.589 ...
$ AMT_INCOME_TOTAL : num [1:30000] -1.07 0.812 0.812 -1.07 -0.612 ...
$ AMT_GOODS_PRICE : num [1:30000] 0.054 -0.917 0.712 0.054 0.631 ...
$ DAYS_EMPLOYED : num [1:30000] -0.00439 -0.28007 0.09285 1.94306 1.94306 ...
$ DAYS_REGISTRATION : num [1:30000] 1.708 -1.341 0.937 1.617 1.605 ...
$ DAYS_BIRTH : num [1:30000] -0.548 0.621 0.899 1.356 1.296 ...
$ AMT_REQ_CREDIT_BUREAU_HOUR : num [1:30000] -0.0731 -0.0731 -0.0731 -0.0731 -0.0731 ...
$ AMT_REQ_CREDIT_BUREAU_DAY : num [1:30000] -0.0632 -0.0632 -0.0632 -0.0632 -0.0632 ...
$ AMT_REQ_CREDIT_BUREAU_WEEK : num [1:30000] -0.166 -0.166 -0.166 -0.166 -0.166 ...
$ AMT_REQ_CREDIT_BUREAU_MON : num [1:30000] -0.294 -0.294 -0.294 -0.294 -0.294 ...
$ AMT_REQ_CREDIT_BUREAU_QRT : num [1:30000] -0.427 -0.427 2.854 1.214 1.214 ...
$ AMT_REQ_CREDIT_BUREAU_YEAR : num [1:30000] -1.0164 0.5839 0.5839 -0.483 0.0505 ...
$ OBS_30_CNT_SOCIAL_CIRCLE : num [1:30000] -0.18 -0.611 -0.18 1.546 0.252 ...
$ DEF_30_CNT_SOCIAL_CIRCLE : num [1:30000] -0.325 -0.325 1.985 1.985 -0.325 ...
$ OBS_60_CNT_SOCIAL_CIRCLE : num [1:30000] -0.174 -0.609 -0.174 1.565 0.261 ...
$ DEF_60_CNT_SOCIAL_CIRCLE : num [1:30000] -0.278 -0.278 2.581 -0.278 -0.278 ...
$ DAYS_LAST_PHONE_CHANGE : num [1:30000] -2.365 0.378 0.634 0.296 0.29 ...
$ TARGET : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ CODE_GENDER_M : num [1:30000] 0 1 0 0 0 0 0 1 0 0 ...
$ NAME_CONTRACT_TYPE_Revolving.loans : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ FLAG_OWN_CAR_Y : num [1:30000] 0 1 0 0 0 0 0 1 0 0 ...
$ FLAG_OWN_REALTY_Y : num [1:30000] 0 0 1 1 1 1 1 1 1 0 ...
$ NAME_INCOME_TYPE_Pensioner : num [1:30000] 0 0 0 1 1 1 0 0 1 1 ...
$ NAME_INCOME_TYPE_State.servant : num [1:30000] 0 0 0 0 0 0 1 0 0 0 ...
$ NAME_INCOME_TYPE_Student : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ NAME_INCOME_TYPE_Working : num [1:30000] 1 1 1 0 0 0 0 0 0 0 ...
$ NAME_EDUCATION_TYPE_Higher.education : num [1:30000] 0 0 0 0 1 0 0 0 0 0 ...
$ NAME_EDUCATION_TYPE_Incomplete.higher : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ NAME_EDUCATION_TYPE_Lower.secondary : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ NAME_EDUCATION_TYPE_Secondary...secondary.special: num [1:30000] 1 1 1 1 0 1 1 1 1 1 ...
$ NAME_FAMILY_STATUS_Married : num [1:30000] 1 0 0 0 0 1 0 1 1 1 ...
$ NAME_FAMILY_STATUS_Separated : num [1:30000] 0 0 1 0 1 0 0 0 0 0 ...
$ NAME_FAMILY_STATUS_Single...not.married : num [1:30000] 0 1 0 1 0 0 1 0 0 0 ...
$ NAME_FAMILY_STATUS_Widow : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ NAME_HOUSING_TYPE_House...apartment : num [1:30000] 0 1 1 1 1 1 0 1 1 1 ...
$ NAME_HOUSING_TYPE_Municipal.apartment : num [1:30000] 0 0 0 0 0 0 1 0 0 0 ...
$ NAME_HOUSING_TYPE_Office.apartment : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ NAME_HOUSING_TYPE_Rented.apartment : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ NAME_HOUSING_TYPE_With.parents : num [1:30000] 1 0 0 0 0 0 0 0 0 0 ...
$ WEEKDAY_APPR_PROCESS_START_MONDAY : num [1:30000] 0 0 1 0 0 1 0 0 1 0 ...
$ WEEKDAY_APPR_PROCESS_START_SATURDAY : num [1:30000] 0 0 0 0 0 0 1 1 0 0 ...
$ WEEKDAY_APPR_PROCESS_START_SUNDAY : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ WEEKDAY_APPR_PROCESS_START_THURSDAY : num [1:30000] 0 1 0 0 0 0 0 0 0 1 ...
$ WEEKDAY_APPR_PROCESS_START_TUESDAY : num [1:30000] 1 0 0 0 0 0 0 0 0 0 ...
$ WEEKDAY_APPR_PROCESS_START_WEDNESDAY : num [1:30000] 0 0 0 0 1 0 0 0 0 0 ...
tibble [30,000 × 48] (S3: tbl_df/tbl/data.frame)
$ REG_REGION_NOT_LIVE_REGION : num [1:30000] -0.12 -0.12 -0.12 -0.12 -0.12 ...
$ AMT_ANNUITY : num [1:30000] -0.299 -1.163 0.338 -0.741 0.357 ...
$ AMT_CREDIT : num [1:30000] -0.392 -0.947 0.485 -0.158 0.373 ...
$ CNT_CHILDREN : num [1:30000] -0.589 -0.589 0.796 -0.589 -0.589 ...
$ AMT_INCOME_TOTAL : num [1:30000] -0.8 0.526 0.526 -0.8 -0.579 ...
$ AMT_GOODS_PRICE : num [1:30000] -0.261 -0.866 0.465 -0.261 0.357 ...
$ DAYS_EMPLOYED : num [1:30000] -0.481 -0.467 -0.488 2.131 2.131 ...
$ DAYS_REGISTRATION : num [1:30000] -2.134 1.187 -0.921 -1.979 -1.958 ...
$ DAYS_BIRTH : num [1:30000] 0.598 -0.581 -0.889 -1.417 -1.347 ...
$ AMT_REQ_CREDIT_BUREAU_HOUR : num [1:30000] -0.0731 -0.0731 -0.0731 -0.0731 -0.0731 ...
$ AMT_REQ_CREDIT_BUREAU_DAY : num [1:30000] -0.0632 -0.0632 -0.0632 -0.0632 -0.0632 ...
$ AMT_REQ_CREDIT_BUREAU_WEEK : num [1:30000] -0.166 -0.166 -0.166 -0.166 -0.166 ...
$ AMT_REQ_CREDIT_BUREAU_MON : num [1:30000] -0.294 -0.294 -0.294 -0.294 -0.294 ...
$ AMT_REQ_CREDIT_BUREAU_QRT : num [1:30000] -0.427 -0.427 2.854 1.214 1.214 ...
$ AMT_REQ_CREDIT_BUREAU_YEAR : num [1:30000] -1.0164 0.5839 0.5839 -0.483 0.0505 ...
$ OBS_30_CNT_SOCIAL_CIRCLE : num [1:30000] -0.18 -0.611 -0.18 1.546 0.252 ...
$ DEF_30_CNT_SOCIAL_CIRCLE : num [1:30000] -0.325 -0.325 1.985 1.985 -0.325 ...
$ OBS_60_CNT_SOCIAL_CIRCLE : num [1:30000] -0.174 -0.609 -0.174 1.565 0.261 ...
$ DEF_60_CNT_SOCIAL_CIRCLE : num [1:30000] -0.278 -0.278 2.581 -0.278 -0.278 ...
$ DAYS_LAST_PHONE_CHANGE : num [1:30000] -2.365 0.378 0.634 0.296 0.29 ...
$ TARGET : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ CODE_GENDER_M : num [1:30000] 0 1 0 0 0 0 0 1 0 0 ...
$ NAME_CONTRACT_TYPE_Revolving.loans : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ FLAG_OWN_CAR_Y : num [1:30000] 0 1 0 0 0 0 0 1 0 0 ...
$ FLAG_OWN_REALTY_Y : num [1:30000] 0 0 1 1 1 1 1 1 1 0 ...
$ NAME_INCOME_TYPE_Pensioner : num [1:30000] 0 0 0 1 1 1 0 0 1 1 ...
$ NAME_INCOME_TYPE_State.servant : num [1:30000] 0 0 0 0 0 0 1 0 0 0 ...
$ NAME_INCOME_TYPE_Student : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ NAME_INCOME_TYPE_Working : num [1:30000] 1 1 1 0 0 0 0 0 0 0 ...
$ NAME_EDUCATION_TYPE_Higher.education : num [1:30000] 0 0 0 0 1 0 0 0 0 0 ...
$ NAME_EDUCATION_TYPE_Incomplete.higher : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ NAME_EDUCATION_TYPE_Lower.secondary : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ NAME_EDUCATION_TYPE_Secondary...secondary.special: num [1:30000] 1 1 1 1 0 1 1 1 1 1 ...
$ NAME_FAMILY_STATUS_Married : num [1:30000] 1 0 0 0 0 1 0 1 1 1 ...
$ NAME_FAMILY_STATUS_Separated : num [1:30000] 0 0 1 0 1 0 0 0 0 0 ...
$ NAME_FAMILY_STATUS_Single...not.married : num [1:30000] 0 1 0 1 0 0 1 0 0 0 ...
$ NAME_FAMILY_STATUS_Widow : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ NAME_HOUSING_TYPE_House...apartment : num [1:30000] 0 1 1 1 1 1 0 1 1 1 ...
$ NAME_HOUSING_TYPE_Municipal.apartment : num [1:30000] 0 0 0 0 0 0 1 0 0 0 ...
$ NAME_HOUSING_TYPE_Office.apartment : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ NAME_HOUSING_TYPE_Rented.apartment : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ NAME_HOUSING_TYPE_With.parents : num [1:30000] 1 0 0 0 0 0 0 0 0 0 ...
$ WEEKDAY_APPR_PROCESS_START_MONDAY : num [1:30000] 0 0 1 0 0 1 0 0 1 0 ...
$ WEEKDAY_APPR_PROCESS_START_SATURDAY : num [1:30000] 0 0 0 0 0 0 1 1 0 0 ...
$ WEEKDAY_APPR_PROCESS_START_SUNDAY : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ WEEKDAY_APPR_PROCESS_START_THURSDAY : num [1:30000] 0 1 0 0 0 0 0 0 0 1 ...
$ WEEKDAY_APPR_PROCESS_START_TUESDAY : num [1:30000] 1 0 0 0 0 0 0 0 0 0 ...
$ WEEKDAY_APPR_PROCESS_START_WEDNESDAY : num [1:30000] 0 0 0 0 1 0 0 0 0 0 ...
Model Workflow
We define and combine the logistic regression model with the preprocessing recipe into a workflow.
k-Nearest Neighbors (KNN)
Recipe Definition
For KNN, we create a recipe that includes:
- One-Hot Encoding: Converts categorical variables.
- Normalization: Normalizes numerical predictors to a standard range.
- Centering and Scaling: Centers and scales numerical predictors to ensure they contribute equally to the distance calculations in KNN.
Apply the Recipe
Apply the KNN recipes to ensure all preprocessing steps are correctly implemented.
tibble [30,000 × 48] (S3: tbl_df/tbl/data.frame)
$ REG_REGION_NOT_LIVE_REGION : num [1:30000] -0.122 -0.122 -0.122 -0.122 -0.122 ...
$ AMT_ANNUITY : num [1:30000] -1.505 0.555 0.57 -0.259 0.37 ...
$ AMT_CREDIT : num [1:30000] -1.078 0.738 0.656 0.222 -0.268 ...
$ CNT_CHILDREN : num [1:30000] -0.59 0.77 -0.59 -0.59 -0.59 ...
$ AMT_INCOME_TOTAL : num [1:30000] 0.827 0.827 -0.62 -1.578 0.827 ...
$ AMT_GOODS_PRICE : num [1:30000] -0.914 0.741 0.658 0.128 -0.281 ...
$ DAYS_EMPLOYED : num [1:30000] -0.237 0.142 2.021 2.021 -0.228 ...
$ DAYS_REGISTRATION : num [1:30000] -1.316 0.966 1.635 1.306 -0.886 ...
$ DAYS_BIRTH : num [1:30000] 0.673 0.949 1.343 1.221 0.682 ...
$ AMT_REQ_CREDIT_BUREAU_HOUR : num [1:30000] -0.0718 -0.0718 -0.0718 -0.0718 -0.0718 ...
$ AMT_REQ_CREDIT_BUREAU_DAY : num [1:30000] -0.0698 -0.0698 -0.0698 -0.0698 -0.0698 ...
$ AMT_REQ_CREDIT_BUREAU_WEEK : num [1:30000] -0.157 -0.157 -0.157 -0.157 -0.157 ...
$ AMT_REQ_CREDIT_BUREAU_MON : num [1:30000] -0.295 -0.295 -0.295 -0.295 -0.295 ...
$ AMT_REQ_CREDIT_BUREAU_QRT : num [1:30000] -0.42 2.87 1.22 -0.42 -0.42 ...
$ AMT_REQ_CREDIT_BUREAU_YEAR : num [1:30000] 0.577 0.577 0.043 1.112 -0.491 ...
$ OBS_30_CNT_SOCIAL_CIRCLE : num [1:30000] -0.619 -0.194 0.231 -0.619 1.082 ...
$ DEF_30_CNT_SOCIAL_CIRCLE : num [1:30000] -0.338 1.888 -0.338 -0.338 1.888 ...
$ OBS_60_CNT_SOCIAL_CIRCLE : num [1:30000] -0.617 -0.189 0.24 -0.617 1.098 ...
$ DEF_60_CNT_SOCIAL_CIRCLE : num [1:30000] -0.291 2.451 -0.291 -0.291 2.451 ...
$ DAYS_LAST_PHONE_CHANGE : num [1:30000] 0.35 0.608 0.261 -0.286 0.422 ...
$ TARGET : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ CODE_GENDER_M : num [1:30000] 1 0 0 0 0 1 0 0 1 1 ...
$ NAME_CONTRACT_TYPE_Revolving.loans : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ FLAG_OWN_CAR_Y : num [1:30000] 1 0 0 0 0 1 0 0 0 1 ...
$ FLAG_OWN_REALTY_Y : num [1:30000] 0 1 1 1 1 1 1 0 0 1 ...
$ NAME_INCOME_TYPE_Pensioner : num [1:30000] 0 0 1 1 0 0 1 1 0 0 ...
$ NAME_INCOME_TYPE_State.servant : num [1:30000] 0 0 0 0 1 0 0 0 0 0 ...
$ NAME_INCOME_TYPE_Student : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ NAME_INCOME_TYPE_Working : num [1:30000] 1 1 0 0 0 0 0 0 1 1 ...
$ NAME_EDUCATION_TYPE_Higher.education : num [1:30000] 0 0 1 0 0 0 0 0 0 0 ...
$ NAME_EDUCATION_TYPE_Incomplete.higher : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ NAME_EDUCATION_TYPE_Lower.secondary : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ NAME_EDUCATION_TYPE_Secondary...secondary.special: num [1:30000] 1 1 0 1 1 1 1 1 1 1 ...
$ NAME_FAMILY_STATUS_Married : num [1:30000] 0 0 0 1 0 1 1 1 0 1 ...
$ NAME_FAMILY_STATUS_Separated : num [1:30000] 0 1 1 0 0 0 0 0 1 0 ...
$ NAME_FAMILY_STATUS_Single...not.married : num [1:30000] 1 0 0 0 1 0 0 0 0 0 ...
$ NAME_FAMILY_STATUS_Widow : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ NAME_HOUSING_TYPE_House...apartment : num [1:30000] 1 1 1 1 0 1 1 1 1 1 ...
$ NAME_HOUSING_TYPE_Municipal.apartment : num [1:30000] 0 0 0 0 1 0 0 0 0 0 ...
$ NAME_HOUSING_TYPE_Office.apartment : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ NAME_HOUSING_TYPE_Rented.apartment : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ NAME_HOUSING_TYPE_With.parents : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ WEEKDAY_APPR_PROCESS_START_MONDAY : num [1:30000] 0 1 0 1 0 0 1 0 0 0 ...
$ WEEKDAY_APPR_PROCESS_START_SATURDAY : num [1:30000] 0 0 0 0 1 1 0 0 0 0 ...
$ WEEKDAY_APPR_PROCESS_START_SUNDAY : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ WEEKDAY_APPR_PROCESS_START_THURSDAY : num [1:30000] 1 0 0 0 0 0 0 1 0 1 ...
$ WEEKDAY_APPR_PROCESS_START_TUESDAY : num [1:30000] 0 0 0 0 0 0 0 0 1 0 ...
$ WEEKDAY_APPR_PROCESS_START_WEDNESDAY : num [1:30000] 0 0 1 0 0 0 0 0 0 0 ...
tibble [30,000 × 48] (S3: tbl_df/tbl/data.frame)
$ REG_REGION_NOT_LIVE_REGION : num [1:30000] -0.122 -0.122 -0.122 -0.122 -0.122 ...
$ AMT_ANNUITY : num [1:30000] -1.19 0.348 0.367 -0.465 0.131 ...
$ AMT_CREDIT : num [1:30000] -0.9487 0.5193 0.4048 -0.0999 -0.5129 ...
$ CNT_CHILDREN : num [1:30000] -0.59 0.77 -0.59 -0.59 -0.59 ...
$ AMT_INCOME_TOTAL : num [1:30000] 0.547 0.547 -0.583 -0.999 0.547 ...
$ AMT_GOODS_PRICE : num [1:30000] -0.863 0.507 0.395 -0.191 -0.515 ...
$ DAYS_EMPLOYED : num [1:30000] -0.449 -0.47 2.222 2.222 -0.449 ...
$ DAYS_REGISTRATION : num [1:30000] 1.17 -0.958 -2.005 -1.466 0.939 ...
$ DAYS_BIRTH : num [1:30000] -0.635 -0.942 -1.398 -1.254 -0.645 ...
$ AMT_REQ_CREDIT_BUREAU_HOUR : num [1:30000] -0.0718 -0.0718 -0.0718 -0.0718 -0.0718 ...
$ AMT_REQ_CREDIT_BUREAU_DAY : num [1:30000] -0.0698 -0.0698 -0.0698 -0.0698 -0.0698 ...
$ AMT_REQ_CREDIT_BUREAU_WEEK : num [1:30000] -0.157 -0.157 -0.157 -0.157 -0.157 ...
$ AMT_REQ_CREDIT_BUREAU_MON : num [1:30000] -0.295 -0.295 -0.295 -0.295 -0.295 ...
$ AMT_REQ_CREDIT_BUREAU_QRT : num [1:30000] -0.42 2.87 1.22 -0.42 -0.42 ...
$ AMT_REQ_CREDIT_BUREAU_YEAR : num [1:30000] 0.577 0.577 0.043 1.112 -0.491 ...
$ OBS_30_CNT_SOCIAL_CIRCLE : num [1:30000] -0.619 -0.194 0.231 -0.619 1.082 ...
$ DEF_30_CNT_SOCIAL_CIRCLE : num [1:30000] -0.338 1.888 -0.338 -0.338 1.888 ...
$ OBS_60_CNT_SOCIAL_CIRCLE : num [1:30000] -0.617 -0.189 0.24 -0.617 1.098 ...
$ DEF_60_CNT_SOCIAL_CIRCLE : num [1:30000] -0.291 2.451 -0.291 -0.291 2.451 ...
$ DAYS_LAST_PHONE_CHANGE : num [1:30000] 0.35 0.608 0.261 -0.286 0.422 ...
$ TARGET : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ CODE_GENDER_M : num [1:30000] 1 0 0 0 0 1 0 0 1 1 ...
$ NAME_CONTRACT_TYPE_Revolving.loans : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ FLAG_OWN_CAR_Y : num [1:30000] 1 0 0 0 0 1 0 0 0 1 ...
$ FLAG_OWN_REALTY_Y : num [1:30000] 0 1 1 1 1 1 1 0 0 1 ...
$ NAME_INCOME_TYPE_Pensioner : num [1:30000] 0 0 1 1 0 0 1 1 0 0 ...
$ NAME_INCOME_TYPE_State.servant : num [1:30000] 0 0 0 0 1 0 0 0 0 0 ...
$ NAME_INCOME_TYPE_Student : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ NAME_INCOME_TYPE_Working : num [1:30000] 1 1 0 0 0 0 0 0 1 1 ...
$ NAME_EDUCATION_TYPE_Higher.education : num [1:30000] 0 0 1 0 0 0 0 0 0 0 ...
$ NAME_EDUCATION_TYPE_Incomplete.higher : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ NAME_EDUCATION_TYPE_Lower.secondary : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ NAME_EDUCATION_TYPE_Secondary...secondary.special: num [1:30000] 1 1 0 1 1 1 1 1 1 1 ...
$ NAME_FAMILY_STATUS_Married : num [1:30000] 0 0 0 1 0 1 1 1 0 1 ...
$ NAME_FAMILY_STATUS_Separated : num [1:30000] 0 1 1 0 0 0 0 0 1 0 ...
$ NAME_FAMILY_STATUS_Single...not.married : num [1:30000] 1 0 0 0 1 0 0 0 0 0 ...
$ NAME_FAMILY_STATUS_Widow : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ NAME_HOUSING_TYPE_House...apartment : num [1:30000] 1 1 1 1 0 1 1 1 1 1 ...
$ NAME_HOUSING_TYPE_Municipal.apartment : num [1:30000] 0 0 0 0 1 0 0 0 0 0 ...
$ NAME_HOUSING_TYPE_Office.apartment : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ NAME_HOUSING_TYPE_Rented.apartment : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ NAME_HOUSING_TYPE_With.parents : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ WEEKDAY_APPR_PROCESS_START_MONDAY : num [1:30000] 0 1 0 1 0 0 1 0 0 0 ...
$ WEEKDAY_APPR_PROCESS_START_SATURDAY : num [1:30000] 0 0 0 0 1 1 0 0 0 0 ...
$ WEEKDAY_APPR_PROCESS_START_SUNDAY : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ WEEKDAY_APPR_PROCESS_START_THURSDAY : num [1:30000] 1 0 0 0 0 0 0 1 0 1 ...
$ WEEKDAY_APPR_PROCESS_START_TUESDAY : num [1:30000] 0 0 0 0 0 0 0 0 1 0 ...
$ WEEKDAY_APPR_PROCESS_START_WEDNESDAY : num [1:30000] 0 0 1 0 0 0 0 0 0 0 ...
Model Workflow
We define the KNN model, set up a parameter grid for tuning, and combine it with the preprocessing recipe.
Random Forest (RF)
Recipe Definition
For RF, we use a similar recipe to logistic regression with log transformations, centering, and scaling of numerical predictors.
tibble [30,000 × 48] (S3: tbl_df/tbl/data.frame)
$ REG_REGION_NOT_LIVE_REGION : num [1:30000] -0.122 -0.122 -0.122 -0.122 -0.122 ...
$ AMT_ANNUITY : num [1:30000] -1.505 0.555 0.57 -0.259 0.37 ...
$ AMT_CREDIT : num [1:30000] -1.078 0.738 0.656 0.222 -0.268 ...
$ CNT_CHILDREN : num [1:30000] -0.59 0.77 -0.59 -0.59 -0.59 ...
$ AMT_INCOME_TOTAL : num [1:30000] 0.827 0.827 -0.62 -1.578 0.827 ...
$ AMT_GOODS_PRICE : num [1:30000] -0.914 0.741 0.658 0.128 -0.281 ...
$ DAYS_EMPLOYED : num [1:30000] -0.237 0.142 2.021 2.021 -0.228 ...
$ DAYS_REGISTRATION : num [1:30000] -1.316 0.966 1.635 1.306 -0.886 ...
$ DAYS_BIRTH : num [1:30000] 0.673 0.949 1.343 1.221 0.682 ...
$ AMT_REQ_CREDIT_BUREAU_HOUR : num [1:30000] -0.0718 -0.0718 -0.0718 -0.0718 -0.0718 ...
$ AMT_REQ_CREDIT_BUREAU_DAY : num [1:30000] -0.0698 -0.0698 -0.0698 -0.0698 -0.0698 ...
$ AMT_REQ_CREDIT_BUREAU_WEEK : num [1:30000] -0.157 -0.157 -0.157 -0.157 -0.157 ...
$ AMT_REQ_CREDIT_BUREAU_MON : num [1:30000] -0.295 -0.295 -0.295 -0.295 -0.295 ...
$ AMT_REQ_CREDIT_BUREAU_QRT : num [1:30000] -0.42 2.87 1.22 -0.42 -0.42 ...
$ AMT_REQ_CREDIT_BUREAU_YEAR : num [1:30000] 0.577 0.577 0.043 1.112 -0.491 ...
$ OBS_30_CNT_SOCIAL_CIRCLE : num [1:30000] -0.619 -0.194 0.231 -0.619 1.082 ...
$ DEF_30_CNT_SOCIAL_CIRCLE : num [1:30000] -0.338 1.888 -0.338 -0.338 1.888 ...
$ OBS_60_CNT_SOCIAL_CIRCLE : num [1:30000] -0.617 -0.189 0.24 -0.617 1.098 ...
$ DEF_60_CNT_SOCIAL_CIRCLE : num [1:30000] -0.291 2.451 -0.291 -0.291 2.451 ...
$ DAYS_LAST_PHONE_CHANGE : num [1:30000] 0.35 0.608 0.261 -0.286 0.422 ...
$ TARGET : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ CODE_GENDER_M : num [1:30000] 1 0 0 0 0 1 0 0 1 1 ...
$ NAME_CONTRACT_TYPE_Revolving.loans : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ FLAG_OWN_CAR_Y : num [1:30000] 1 0 0 0 0 1 0 0 0 1 ...
$ FLAG_OWN_REALTY_Y : num [1:30000] 0 1 1 1 1 1 1 0 0 1 ...
$ NAME_INCOME_TYPE_Pensioner : num [1:30000] 0 0 1 1 0 0 1 1 0 0 ...
$ NAME_INCOME_TYPE_State.servant : num [1:30000] 0 0 0 0 1 0 0 0 0 0 ...
$ NAME_INCOME_TYPE_Student : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ NAME_INCOME_TYPE_Working : num [1:30000] 1 1 0 0 0 0 0 0 1 1 ...
$ NAME_EDUCATION_TYPE_Higher.education : num [1:30000] 0 0 1 0 0 0 0 0 0 0 ...
$ NAME_EDUCATION_TYPE_Incomplete.higher : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ NAME_EDUCATION_TYPE_Lower.secondary : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ NAME_EDUCATION_TYPE_Secondary...secondary.special: num [1:30000] 1 1 0 1 1 1 1 1 1 1 ...
$ NAME_FAMILY_STATUS_Married : num [1:30000] 0 0 0 1 0 1 1 1 0 1 ...
$ NAME_FAMILY_STATUS_Separated : num [1:30000] 0 1 1 0 0 0 0 0 1 0 ...
$ NAME_FAMILY_STATUS_Single...not.married : num [1:30000] 1 0 0 0 1 0 0 0 0 0 ...
$ NAME_FAMILY_STATUS_Widow : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ NAME_HOUSING_TYPE_House...apartment : num [1:30000] 1 1 1 1 0 1 1 1 1 1 ...
$ NAME_HOUSING_TYPE_Municipal.apartment : num [1:30000] 0 0 0 0 1 0 0 0 0 0 ...
$ NAME_HOUSING_TYPE_Office.apartment : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ NAME_HOUSING_TYPE_Rented.apartment : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ NAME_HOUSING_TYPE_With.parents : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ WEEKDAY_APPR_PROCESS_START_MONDAY : num [1:30000] 0 1 0 1 0 0 1 0 0 0 ...
$ WEEKDAY_APPR_PROCESS_START_SATURDAY : num [1:30000] 0 0 0 0 1 1 0 0 0 0 ...
$ WEEKDAY_APPR_PROCESS_START_SUNDAY : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
$ WEEKDAY_APPR_PROCESS_START_THURSDAY : num [1:30000] 1 0 0 0 0 0 0 1 0 1 ...
$ WEEKDAY_APPR_PROCESS_START_TUESDAY : num [1:30000] 0 0 0 0 0 0 0 0 1 0 ...
$ WEEKDAY_APPR_PROCESS_START_WEDNESDAY : num [1:30000] 0 0 1 0 0 0 0 0 0 0 ...
Model Workflow
We define the RF model, set up a parameter grid for tuning, and combine it with the preprocessing recipe.
Fine-Tuning and Training Models
We’ll proceed by fine-tuning and training our models: logistic regression, k-Nearest Neighbors (KNN), and random forest (RF). For each model, we’ll use cross-validation, perform a grid search for hyperparameter tuning (where applicable), and save the results. Finally, we’ll analyze the performance metrics for each model, including confusion matrices, to assess their effectiveness on the test data.
Logistic Regression, KNN, and Random Forest
Lets begin fine tuning!
| Model | Metric | Mean | Standard Error |
|---|---|---|---|
| Transformed | accuracy | 0.7403001 | 0.0011883 |
| Transformed | roc_auc | 0.6592071 | 0.0037212 |
| Regular | accuracy | 0.7405333 | 0.0009099 |
| Regular | roc_auc | 0.6610435 | 0.0033060 |
| Model | Metric | Estimate |
|---|---|---|
| Transformed | accuracy | 0.8556667 |
| Transformed | kap | 0.0194780 |
| No Transformation | accuracy | 0.8575000 |
| No Transformation | kap | 0.0291659 |
| Model | Metric | Estimate |
|---|---|---|
| Transformed | accuracy | 0.9195 |
| Transformed | kap | 0.0000 |
Analyzing Model Performance
After training the models, we analyze their performance using the metrics collected during cross-validation and the predictions on the test dataset. We are going to determine the best model by considering the trade offs between false positives and false negatives and each models ROC_AUC parameter.
False Positives (FP)
- Definition: False positives occur when the model predicts that a borrower will default on their loan (i.e.,
TARGET = 1), but in reality, the borrower does not default. - Impact:
- Financial Cost: The model incorrectly identifies a credit-worthy borrower as a risk, which could lead to unnecessary denial of credit. This might result in lost opportunities for the lender and potential revenue.
- Customer Experience: Borrowers who are incorrectly labeled as high-risk might experience frustration or inconvenience if they are denied credit or face higher interest rates.
False Negatives (FN)
- Definition: False negatives occur when the model predicts that a borrower will not default on their loan (i.e.,
TARGET = 0), but in reality, the borrower does default. - Impact:
- Financial Risk: The model fails to identify a high-risk borrower, potentially leading to financial losses due to defaults that could have been anticipated and mitigated.
- Risk Management: The lender might face higher-than-expected default rates, which can affect profitability and increase the need for more stringent risk management strategies.
Balancing False Positives and False Negatives
In credit risk prediction, it’s crucial to balance false positives and false negatives:
Minimizing False Positives: Reducing false positives helps in approving more credit-worthy applicants. However, if reduced too much, it might lead to increased false negatives.
Minimizing False Negatives: Reducing false negatives ensures that potential defaults are caught early, but if too aggressive, it might result in higher false positives.
Lets analyze the confusion matrix of all models when it is fit to the training data. While this is not common practice and a lot of this computation could have been saved before hand by analyzing our ROC_AUC and accuracy metrics, I believe in this case it was important to get the full picture because high accuracy in our case may not mean a successful model. For example if our model guesses the majority class, it is guaranteed above a 90% accuracy for representative samples of our population data. Therefore, by analyzing the confusion matrix before hand we can make a better informed decision when ultimately declaring a model as the winner.
Final Model Evaluation
After experimenting with multiple models, we determined that the K-Nearest Neighbors (KNN) model without transformations surprisingly outperformed other models like Logistic Regression and Random Forest. While KNN’s success might seem counterintuitive given its sensitivity to unscaled data and outliers, it performed best in terms of the ROC AUC metric. However, despite being the best-performing model from the bunch, there are several important considerations regarding its actual effectiveness.
1. ROC AUC Performance and Interpretation:
- The ROC AUC for the KNN model was surprisingly low, and the ROC curve appeared below the y = x line. This is significant because the y = x line represents a random classifier (i.e., a model with no discriminative power, where the TPR equals the FPR). When the ROC curve falls below this line, it suggests that the model is performing worse than random guessing.
- Why this happened: This outcome indicates that the model might be systematically predicting the opposite class or is heavily skewed by imbalanced data, leading to poor generalization on the test data. Despite the model achieving some level of performance during training and cross-validation, it struggles to differentiate between the positive and negative classes in real-world (testing) scenarios.
2. Confusion Matrix Analysis:
- The confusion matrix for the KNN model reveals that the model has difficulty in identifying the positive (defaulting) class. The true positives (correctly predicted defaults) are low, while the false negatives (missed defaults) are high. However this is also the case for the other models except KNN along with Logistic Regression did not conform to only guessing the non defaulting class.
- False Positives and False Negatives: In this context, the high number of false negatives is particularly concerning because it means the model is failing to identify borrowers who will actually default on their loans. This could result in significant financial risks if deployed in a real-world setting.
3. Baseline Model Comparison:
- It’s essential to compare the KNN model to a baseline or null model. A baseline model could simply predict the majority class (e.g., predicting all borrowers as non-defaulting). If our KNN model’s performance is not substantially better than this baseline, the effort involved in building and tuning the model may not be justified.
- Does the effort pay off?: Given the low ROC AUC and the model’s difficulty in identifying the defaulting class, the predictive power of the KNN model may not be worth the complexity and effort invested. In this case, a simpler approach, such as a rule-based system or a machine learning model that penalizes double or triple for incorrectly predicting the minority class with regularization to prevent it from being overwhelmed with the class imbalance, which would put greater emphasis to predict correctly from the minority class.
4. Challenges with Imbalanced Data:
- Even though we applied techniques like upsampling to close the gap between the minority and majority classes, the Random Forest model (and to some extent, the other models) was still overwhelmed by the majority class. This underscores the challenge of dealing with imbalanced data, where traditional machine learning algorithms can struggle to learn meaningful patterns from the minority class.
- Nonlinearities and Complexity: Random Forest, being a more complex model that handles nonlinearities well, should theoretically perform better in capturing intricate relationships. However, the imbalance in the dataset and perhaps overfitting to the majority class might have hindered its performance. This suggests that more advanced techniques and sophisticated resampling strategies, might be necessary.
5. ROC Curve and Performance Visualization:
- To further assess the model’s performance, plotting the ROC curves for all models reveals that not only does the KNN model perform poorly, but the ROC curves for the other models also struggle to stay above the y = x line. This indicates that all models are facing difficulty in distinguishing between the positive and negative classes, with little improvement over random guessing.
Conclusion
Best of the models, But Not Enough: The KNN model without transformations performed best in our testing, but it still struggles with identifying the defaulting class accurately. The low ROC AUC and the confusion matrix results indicate that this model may not be reliable enough for real-world deployment.
Key Learnings: The imbalance in the dataset, model complexity, and the challenges of proper resampling all contributed to the difficulties faced by our models. Future work should consider more advanced resampling techniques, class-weighted models, or other approaches tailored to handling imbalanced data effectively.